Sales prediction model for State of Connecticut Real Estate Sales 2001-2022 GL¶
The Office of Policy and Management maintains a listing of all real estate sales with a sales price of $2,000 or greater that occur between October 1 and September 30 of each year. For each sale record, the file includes: town, property address, date of sale, property type (residential, apartment, commercial, industrial or vacant land), sales price, and property assessment.
Source Data : https://catalog.data.gov/dataset/real-estate-sales-2001-2018
Return Home : https://johnkimaiyo.vercel.app/
Creating a prediction model using Python and Pandas involves several steps, including data preprocessing, exploratory data analysis, feature engineering, model selection, training, and evaluation.
Step 1: Import Necessary Libraries¶
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import joblib
Step 2: Load the Dataset¶
Real_Estate_Sales_df = pd.read_csv(r"C:\Users\jki\Desktop\Data Scence Projects\Sate of Connecticut Real Etstae Sales\Machine Learning\Source Data\Real_Estate_Sales_2001-2022_GL.csv")
# Display the first few rows of the dataset
print(Real_Estate_Sales_df.head())
C:\Users\jki\AppData\Local\Temp\ipykernel_20024\2726286131.py:1: DtypeWarning: Columns (8,9,10,11,12) have mixed types. Specify dtype option on import or set low_memory=False. Real_Estate_Sales_df = pd.read_csv(r"C:\Users\jki\Desktop\Data Scence Projects\Sate of Connecticut Real Etstae Sales\Machine Learning\Source Data\Real_Estate_Sales_2001-2022_GL.csv")
   Serial Number  List Year Date Recorded     Town           Address  \
0         220008       2022    01/30/2023  Andover       618 ROUTE 6   
1        2020348       2020    09/13/2021  Ansonia   230 WAKELEE AVE   
2          20002       2020    10/02/2020  Ashford   390 TURNPIKE RD   
3         210317       2021    07/05/2022     Avon   53 COTSWOLD WAY   
4         200212       2020    03/09/2021     Avon  5 CHESTNUT DRIVE   
   Assessed Value  Sale Amount  Sales Ratio Property Type Residential Type  \
0        139020.0     232000.0       0.5992   Residential    Single Family   
1        150500.0     325000.0       0.4630    Commercial              NaN   
2        253000.0     430000.0       0.5883   Residential    Single Family   
3        329730.0     805000.0       0.4096   Residential    Single Family   
4        130400.0     179900.0       0.7248   Residential            Condo   
  Non Use Code Assessor Remarks OPM remarks  \
0          NaN              NaN         NaN   
1          NaN              NaN         NaN   
2          NaN              NaN         NaN   
3          NaN              NaN         NaN   
4          NaN              NaN         NaN   
                             Location  
0  POINT (-72.343628962 41.728431984)  
1                                 NaN  
2                                 NaN  
3  POINT (-72.846365959 41.781677018)  
4                                 NaN  
Step 3: Data Preprocessing¶
Before building the model, you need to preprocess the data. This includes handling missing values, converting data types, and encoding categorical variables if necessary.
# lets check for  missing values
missing_values =  Real_Estate_Sales_df.isna().sum()
print(missing_values)
Serial Number 0 List Year 0 Date Recorded 2 Town 0 Address 51 Assessed Value 0 Sale Amount 0 Sales Ratio 0 Property Type 382446 Residential Type 398389 Non Use Code 784178 Assessor Remarks 926401 OPM remarks 1084598 Location 799518 dtype: int64
# lets remove missing values
Real_Estate_Sales_df.dropna(subset=['Date Recorded','Address','Property Type','Residential Type','Non Use Code','Assessor Remarks','OPM remarks','Location'],inplace =True)
# lets confirm existance of missing values
missing_values = Real_Estate_Sales_df.isna().sum()
print(missing_values)
Serial Number 0 List Year 0 Date Recorded 0 Town 0 Address 0 Assessed Value 0 Sale Amount 0 Sales Ratio 0 Property Type 0 Residential Type 0 Non Use Code 0 Assessor Remarks 0 OPM remarks 0 Location 0 dtype: int64
# let check if we have unwanted negative values
Real_Estate_Sales_df.describe()
| Serial Number | List Year | Assessed Value | Sale Amount | Sales Ratio | |
|---|---|---|---|---|---|
| count | 1.097629e+06 | 1.097629e+06 | 1.097629e+06 | 1.097629e+06 | 1.097629e+06 | 
| mean | 5.370357e+05 | 2.011218e+03 | 2.818016e+05 | 4.053146e+05 | 9.603926e+00 | 
| std | 7.526074e+06 | 6.773485e+00 | 1.657890e+06 | 5.143492e+06 | 1.801664e+03 | 
| min | 0.000000e+00 | 2.001000e+03 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 
| 25% | 3.071300e+04 | 2.005000e+03 | 8.909000e+04 | 1.450000e+05 | 4.778667e-01 | 
| 50% | 8.070600e+04 | 2.011000e+03 | 1.405800e+05 | 2.330000e+05 | 6.105663e-01 | 
| 75% | 1.703410e+05 | 2.018000e+03 | 2.282700e+05 | 3.750000e+05 | 7.707200e-01 | 
| max | 2.000500e+09 | 2.022000e+03 | 8.815100e+08 | 5.000000e+09 | 1.226420e+06 | 
# lets check for data types to perform numerical calculations
Real_Estate_Sales_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 560 entries, 527 to 1093559 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Serial Number 560 non-null int64 1 List Year 560 non-null int64 2 Town 560 non-null object 3 Address 560 non-null object 4 Assessed Value 560 non-null float64 5 Sale Amount 560 non-null float64 6 Sales Ratio 560 non-null float64 7 Year 560 non-null int32 8 Month 560 non-null int32 9 Day 560 non-null int32 dtypes: float64(3), int32(3), int64(2), object(2) memory usage: 41.6+ KB
print(Real_Estate_Sales_df.columns)
Index(['Serial Number', 'List Year', 'Town', 'Address', 'Assessed Value',
       'Sale Amount', 'Sales Ratio', 'Year', 'Month', 'Day'],
      dtype='object')
# lets drop unwanted columns 
# Columns to drop
columns_to_drop = ['Property Type', 'Residential Type','Non Use Code','Assessor Remarks','OPM remarks','Location','Address','Town','Serial Number','List Year']
Real_Estate_Sales_df.drop(columns=columns_to_drop, inplace=True,errors='ignore')
Real_Estate_Sales_df.head(5)
| Assessed Value | Sale Amount | Sales Ratio | Year | Month | Day | |
|---|---|---|---|---|---|---|
| 527 | 207700.0 | 400000.0 | 0.5193 | 2023 | 9 | 20 | 
| 922 | 49700.0 | 29900.0 | 1.6622 | 2023 | 2 | 27 | 
| 940 | 208230.0 | 423261.0 | 0.4920 | 2022 | 10 | 31 | 
| 1345 | 169050.0 | 351000.0 | 0.4816 | 2023 | 3 | 30 | 
| 1632 | 769900.0 | 3500000.0 | 0.2200 | 2023 | 4 | 26 | 
Step 4: Exploratory Data Analysis (EDA)¶
# Summary statistics
print(Real_Estate_Sales_df.describe())
# Correlation matrix
print(Real_Estate_Sales_df.corr())
# Plotting the correlation matrix
import seaborn as sns
sns.heatmap(Real_Estate_Sales_df.corr(), annot=True, cmap='coolwarm')
plt.show()
       Assessed Value   Sale Amount  Sales Ratio         Year       Month  \
count    5.600000e+02  5.600000e+02   560.000000   560.000000  560.000000   
mean     1.701261e+05  9.091623e+05     0.838064  2020.926786    6.951786   
std      1.669984e+05  1.346897e+07     1.091531     1.850397    3.215486   
min      0.000000e+00  4.500000e+03     0.000000  2017.000000    1.000000   
25%      9.025750e+04  1.100000e+05     0.413374  2020.000000    4.000000   
50%      1.416000e+05  2.140000e+05     0.625957  2021.000000    7.000000   
75%      1.992075e+05  3.752500e+05     1.020047  2022.000000   10.000000   
max      2.083410e+06  3.187900e+08    20.020000  2023.000000   12.000000   
              Day  
count  560.000000  
mean    15.817857  
std      9.149467  
min      1.000000  
25%      8.000000  
50%     17.000000  
75%     23.000000  
max     31.000000  
                Assessed Value  Sale Amount  Sales Ratio      Year     Month  \
Assessed Value        1.000000     0.048048    -0.046309  0.017759 -0.014153   
Sale Amount           0.048048     1.000000    -0.040326  0.004436 -0.000965   
Sales Ratio          -0.046309    -0.040326     1.000000 -0.008175  0.013997   
Year                  0.017759     0.004436    -0.008175  1.000000 -0.140702   
Month                -0.014153    -0.000965     0.013997 -0.140702  1.000000   
Day                  -0.026584    -0.045498     0.056925  0.012419  0.000674   
                     Day  
Assessed Value -0.026584  
Sale Amount    -0.045498  
Sales Ratio     0.056925  
Year            0.012419  
Month           0.000674  
Day             1.000000  
Step 5: Feature Engineering¶
Feature engineering involves creating new features or transforming existing ones to improve the model's performance.
# Display the first few rows after feature engineering
print(Real_Estate_Sales_df.head())
Assessed Value Sale Amount Sales Ratio Year Month Day 527 207700.0 400000.0 0.5193 2023 9 20 922 49700.0 29900.0 1.6622 2023 2 27 940 208230.0 423261.0 0.4920 2022 10 31 1345 169050.0 351000.0 0.4816 2023 3 30 1632 769900.0 3500000.0 0.2200 2023 4 26
Step 6: Splitting the Data¶
Split the data into training and testing sets
# Define features (X) and target (y)
X = Real_Estate_Sales_df.drop(['Sale Amount'], axis=1)
y = Real_Estate_Sales_df['Sale Amount']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape, X_test.shape)
(448, 5) (112, 5)
Step 7: Model Selection and Training¶
# Initialize the model
model = LinearRegression()
# Train the model
model.fit(X_train, y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Step 8: Model Evaluation¶
Evaluate the model's performance on the test data.
# Make predictions
y_pred = model.predict(X_test)
# Calculate the Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
# Plot the actual vs predicted values
plt.scatter(y_test, y_pred)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Actual vs Predicted')
plt.show()
Mean Squared Error: 1358146224171.0066
Step 9: Making Predictions¶
You can now use the trained model to make predictions on new data.
# Example: Predict on new data
new_data = pd.DataFrame({
   
    'Assessed Value': [20770],
    'Sales Ratio': [0.5],
    'Year': [2026],
    'Month': [5],
    'Day': [15],
    
})
# Save the model to a file
joblib.dump(model, 'real_estate_sales_model.pkl')
predicted_sales = model.predict(new_data)
print(f'Predicted Total Sales: {predicted_sales[0]}')
Predicted Total Sales: 872499.7062873021